79 - HPC Café on February 11, 2025: Handling many small files and managing AI data sets [ID:56285]
50 von 332 angezeigt

So, welcome to today's HPC Cafe.

We're going to talk about handling many small files and how to manage the AI datasets.

And first of all, I think what we all would like to do is be like a sole user of the HPC

clusters so that we can have the high performance.

And what it feels like is like being a super scientist with a computer and then we find

out that it's actually really slow and it feels like being in the traffic jam.

The reason for that is because we're not alone on the cluster, but there are like a thousand

people using the clusters apart from ourselves.

Okay.

So, some jobs can feel really slow.

And we actually have a success story from the support where a user came to us and observed

fluctuating job runtimes and job cancellations.

And the performance of his job was around 28 hours on an Nvidia V100, which does not

really fit into our 24 hour queues, does it?

We analyzed the underlying problem and it was basically that the user had 120 gigabytes

of data in over 340,000 files, which were copied, not even copied, but which were accessed

via the file server.

And the bad performance is you're not alone on the cluster.

There are a lot of other people.

And the solution to this problem is data staging where we combined the 340,000 files into the

single archive file, which in the end was only 10 gigabytes large and then extracted

this file directly to the node local storage.

It took 13 minutes to transfer the data.

And in the end, the performance of the job was 12 hours on the Nvidia V100 and the user

was really happy about it.

So what can we learn here?

We have a runtime gain by providing the data locally on the compute node, which is really

nice and we from the support like it as well.

So we run some benchmarks on different data sets and I'm going to just quickly go through

it.

We have a medium sized data set with around 480,000 files.

In the end, it's about 450 gigabytes of data.

We have a small data set with 90,000 files and 3 gigabytes of data and a big data set

with 140 files and 1.5 terabytes of data.

The interesting thing is that when we archive files, we can go from 480,000 files to 21

files, which is definitely a lot less to copy.

Interestingly, we can also decrease the data size if we use compression.

I will go into detail a bit later.

And even for big data sets, we can decrease the size by around two thirds.

What we benchmarked was different file systems we have available, which is the node local

storage that is an SSD.

We have the ANVME, which is a LASTA-based file system that is accessible via InfiniBand

and has high performance I.O. probabilities.

And then we have the Arterian, probably you all of them know them as the work file system,

which is the central NFS server used for short and midterm storage and the general purpose

file system.

These all have different latencies and bandwidths.

Why would I want to copy data over the network, over the file systems that is really slow

if I can access the data really fast on the node local storage?

It's pretty obvious to me.

Teil einer Videoserie :
Teil eines Kapitels:
HPC Café

Zugänglich über

Offener Zugang

Dauer

00:31:48 Min

Aufnahmedatum

2025-02-11

Hochgeladen am

2025-02-14 16:46:04

Sprache

en-US

Speaker: Dr. Anna Kahler, NHR@FAU

Slides: https://hpc.fau.de/files/2025/02/HPC-Cafe-Small-Files-AI-DataSets-Feb-11-2025.pdf

Abstract:

We invite you to join us for a discussion on data handling, including the possibility of NHR@FAU providing access to popular data sets. As part of this discussion, we will present an overview of the various file systems available for data storage at NHR@FAU, covering key topics such as data archive formats, data copying, archiving, compressing, and unpacking, as well as recommendations for the most effective programs to use. Additionally, we will share best practices for efficient data storage and access in your SLURM scripts.

By taking a few simple steps, many common data handling issues can be resolved, which is crucial given that NHR@FAU supports over 1,000 users and inefficient data usage can impact not only individual workflows but also those of colleagues. Despite the importance of this issue, we continue to observe inefficient data handling practices and believe it is essential to revisit this topic time and again.

Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/

Einbetten
Wordpress FAU Plugin
iFrame
Teilen